Assessment of genetic relationships among native and introduced Himalayan balsam (Impatiens glandulifera) plants based on genome profiling

Abstract We conducted genomic characterization based on SNP and SilicoDArT markers on the invasive Himalayan balsam (Impatiens glandulifera) plants originating from native and non‐native regions of their distribution. When genetic relationships were explored by PCoA using SNP and SilicoDArT marker data, the first, second, and third principal coordinates explained altogether 37.4% and 31.0% of the variability, respectively. Samples from the UK, Canada, and Pakistan were grouped together, while Indian plants were clearly distinct based on SNP markers but relatively close to the UK–Canada–Pakistan group based on SilicoDArT markers. Constructed trees differentiated individuals into clusters resembling the PCoA patterns. The Bayesian BAPS analysis performed for the SNP data revealed that the individuals were distributed in seven clusters, representing samples from each of the four Finnish populations, India, Pakistan, and the combination of the UK and Canada. Similar clustering was visible in the UPGMA tree. The Indian cluster did not display any ancestral gene flow with the others, while the Pakistani cluster showed ancestral gene flow only with the combined UK and Canada cluster. Furthermore, the latter cluster displayed ancestral gene flow with the Finnish populations varying from 0% to 3.1%. The BAPS analyses conducted for the SilicoDArT data differ slightly: The individuals were distributed in nine clusters, and the Indian cluster exhibited ancestral gene flow with the mixed cluster including Canadian, Pakistani, and UK samples, and one Finnish sample. The AMOVA showed that 45% and 26% of variation was present among the I. glandulifera groups/populations and the rest within them based on SNP and SilicoDArT markers, respectively. The Bayesian BAPS analyses and the gene flow networks were the most informative tools for resolving relationships among native and introduced plants. It is notable that the small sample sizes for non‐Finnish plant materials may affect the accuracy of the gene flow and other estimates.


| INTRODUC TI ON
Nonindigenous species are species distributed outside their historic and native range. Their dispersal may occur either intentionally or accidentally, often being promoted by human activities, such as agriculture, horticulture, aquaculture, transportation, and recreation (Kolar & Lodge, 2001). In certain cases, instead of remaining localized within their new environment, nonindigenous species can become invasive and have large and long-lasting impacts on the region and its ecosystems, as they increase in number and expand their geographic range. Invasive species also provide interesting opportunities to investigate evolutionary processes (Huey et al., 2005). At introduction, they typically contain just a portion of the species' gene pool, possibly experiencing a low level of genetic variation and a genetic bottleneck effect. Yet, despite their bottlenecked populations and low evolutionary potential, many invasive species manage to thrive and expand in new environments (Dlugosch & Parker, 2008;Helsen et al., 2019), possibly as a result of the temporal buildup of genetic diversity through multiple introduction events (Helsen et al., 2019) or through phenotypic plasticity facilitating their spread (Skalova et al., 2012).
The Himalayan balsam, Impatiens glandulifera Royle (Balsaminaceae), is a tall, self-pollinated annual plant, which has been introduced widely as a garden ornamental, first to England in 1839 from its native distribution area in Kashmir in the Himalayas (Beerling & Perrins, 1993). Yet, there is some dispute over the date of introduction, as Tanner (2011) has proposed that I. glandulifera was introduced somewhat prior to 1839, and Morgan (2007) states that the species was first introduced into England from Nepal.
Later, due to its popularity as an ornamental plant, high seed production, rapid growth rate, and ability to survive frost, I. glandulifera has spread throughout Europe and it is invasive in parts of Canada, the United States, and New Zealand, with limited occurrence in Japan (Beerling & Perrins, 1993;Cockel & Tanner, 2012;Perrins et al., 1993;Weber, 2003). It has tendency to form dense monocultures and it is a strong competitor with high plastic responses in regions of introduction (Hulme & Bremner, 2006;Perrins et al., 1993;Skalova et al., 2012). Additionally, Elst et al. (2016) have suggested that differences in habitats between the native and invasive range, especially the higher nutrient availability observed in the new environment, are among main factors driving the invasion of I. glandulifera. It has become clear that the species represents a significant threat to native ecosystems in many temperate areas of the world (Tanner & Gange, 2020).
Earlier DNA-based population genetic studies on I. glandulifera have involved the use of microsatellite markers in investigations conducted in the UK (Provan et al., 2007;Walker et al., 2009).
Later, based on microsatellite markers and the sequencing of the nuclear ITS region, Nagy and Korpelainen (2015) compared native and introduced populations of I. glandulifera and discovered distinct population genetic patterns indicating the strong effect of humanmediated dispersal. Comparably, Hagenblad et al. (2015) used microsatellites to investigate I. glandulifera samples from the species' native range in India and from the introduced range within Europe.
They also showed that human activities, at least partially, have facilitated not only introductions, but also further spread of the species across Europe. During recent years, genetic investigations on I. glandulifera on both native and introduced populations have involved the use of chloroplast DNA, including even the analysis of a complete chloroplast genome from a 125-year-old herbarium specimen (Cafa et al., 2020;Kurose et al., 2020). Such studies show that maternal cpDNA provides genetic markers for population studies that could be linked to evolutionary history and phylogeography. Although these markers have been valuable for discovering differentiation and other population genetic processes among I. glandulifera plants from native and introduced areas, their precision and ability to resolve differences and patterns in genetic structures have not been optimal.
During recent years, high-throughput genotyping-by-sequencing (GBS) analyses generating large numbers of markers have become an increasingly frequent approach for molecular characterization and for studies on genetic variation and differentiation (e.g., Ball et al., 2020;Villa-Machío et al., 2020).
In this study, we employed GBS using both SNP and SilicoDArT marker techniques to perform a comprehensive genome-wide analysis of the genetic structure on I. glandulifera. Our aim was to discover two large sets of DNA markers and, by using them, to examine and compare patterns of genetic variability in I. glandulifera both in the native distribution range and in the area of introduction. We hypothesized that plants in the area of first introduction (England) show a closer genetic relationship to the plants from the native area of distribution (the Himalayas) than those from other regions of introduction, and plants in the areas of introduction show considerable differentiation due to genetic drift and anthropogenic effects, for example, seed dispersal commercially and by home gardeners.
Additionally, we hypothesized that the two marker types provide similar results of genetic relationships and differentiation patterns.

| Sampling and DNA extraction
The study material of I. glandulifera representing its native area of ing from populations FI-1, FI-2, FI-3, and FI-4), a total of 94 samples (Appendix S1, see also Nagy & Korpelainen, 2015). Leaf samples from populations in Finland were placed directly into Eppendorf tubes, let dry in open tubes, and used for DNA extractions within a week, while other samples were squashed onto Whatman FTA paper, let dry in air, and then stored at room temperature for several months before being used for DNA extractions. All DNA extractions were conducted using the E.Z.N.A. Plant DNA Mini Kit Spin Protocol (Omega Bio-Tek). All DNA concentrations were measured using NanoDrop Spectrophotometer (Thermo Scientific) and adjusted to 20 ng/µl, with a minimum volume of 10 µl. Then, the DNA samples were placed into a fully skirted 96-well PCR plate, packed, and shipped for genotyping.

| Genotyping by sequencing and quality filtering
Genotyping-by-sequencing analysis of the 94 I. glandulifera samples was conducted using a genome profiling service for SNP and SilicoDArT markers by Diversity Arrays Technology Pty Ltd.
(Canberra, Australia), following the DArT genotyping protocol of Kilian et al. (2012). In this protocol, DNA samples were exposed to digestion-ligation reactions using restriction enzymes, namely Pst1 in combination with Sphl, with the addition of barcoded adaptors corresponding to the overhangs of the two restriction enzymes. The resulting fragments were amplified by PCR, and the amplicons from each sample were pooled and exposed to cBot (Illumina) bridge PCR and then sequenced using Illumina. The algorithm DArTsoft14 was used to extract both SilicoDArT and SNP markers. Genotyping of ten samples failed due to poor DNA quality (two from India, one from Pakistan, and seven from Canada (Appendix S1). All failed samples were among those stored by squashing onto Whatman FTA paper.
The quality of both SNP and SilicoDArT markers was assessed by different quality parameters, such as call rate and reproducibility percentages. SNP markers were first filtered for all secondary (multiple loci within a fragment that are likely to be linked) and monomorphic loci. Call rate and reproducibility were filtered for both types of markers to the threshold of 0.95. All SNP loci were filtered for deviations from HWE, but no loci were deleted upon filtering. The frequency distribution of polymorphism information content (PIC) values was computed for both marker types after filtering. Filtered data were used for subsequent diversity and genetic structuring analyses. Data filtering was performed using the R 4.0.2 (R Core Team, 2020) package dartR (Gruber et al., 2018).

| Genetic diversity and population structure
Genetic diversity indices were calculated using R 4.0.2 (R Core Team, 2020) package dartR (Gruber et al., 2018) for filtered data.
An exception was the only remaining Canadian sample, which was excluded from geographic group/population diversity analyses.  (Nei, 1987).
In addition, the fixation index (population differentiation) (F ST ) and corrected F ST (F STP ), as well as the inbreeding coefficient (F IS ), were computed (Nei, 1987).
Principal coordinate analyses (PCoA) were applied to investigate genetic relationships among individuals from different groups/populations using R 4.0.2 (R Core Team, 2020) package dartR (Gruber et al., 2018). In addition, Euclidean distance matrices were generated and the corresponding unrooted trees were constructed based on both marker sets using R 4.0.2 (R Core Team, 2020) package dartR (Gruber et al., 2018). A correlation between SNP and SilicoDArT distance matrices was determined by the Mantel test (Mantel, 1967) using the same software.
Population structuring based on both marker types was assessed by the program BAPS 6.0 (Corander et al., 2013;Tang et al., 2009), which uses Bayesian methods to discover population structuring. An admixture analysis based on the mixture clustering of groups of individuals was chosen to estimate the K value that best explains the distribution of the individual samples into different genetic clusters.
The analysis was conducted by performing 50 iterations of K (from 2 to 20). The UPGMA trees were constructed based on the Kullback-Leibler divergence matrices that were produced as outputs of the BAPS analyses. Based on the admixture results, the Plot Gene Flow function of the BAPS software was used to estimate and illustrate a network of clusters at the best explaining K value.
We used GenAlEx 6.5 (Peakall & Smouse, 2012)   Yet, a significant positive correlation was found (r = .644, p < .001) between SNP and SilicoDArT distance matrices, as determined by a Mantel test, which proved a good fit between SilicoDArT and SNP marker data sets.

| RE SULTS
Genetic relationships among the I. glandulifera genotypes were assessed also using the Bayesian BAPS analysis, which revealed that based on SNP markers the individuals were distributed in seven clusters (Figure 3a), including clusters Finland-1, Finland-2, Finland-3, Finland-4, UK, and Canada, India, and Pakistan. Similar clustering is also visible in the UPGMA tree constructed based on the divergence matrix, provided as an output of the BAPS analysis The individuals were distributed in nine clusters, as two of the four Finnish populations were split into two clusters and the UK samples were split into two clusters, while the Canadian, Pakistani, and most UK samples, and one Finnish sample from population TAH (FI-3) formed a mixed cluster (Figure 3b; Appendix S4B). In addition, the Indian cluster exhibited ancestral gene flow (6.6%) with the mixed F I G U R E 4 A gene flow network identified for the seven clusters (K = 7) of Impatiens glandulifera by BAPS based on SNP markers (a), and for the nine clusters (K = 9) based on SilicoDArT markers (b). In parentheses, the numbers of samples in the cluster if the geographic group/population was present in more than one cluster. Gene flow is shown by weighted arrows, which indicate relative average amounts of ancestry coming from the source cluster but present now among individuals assigned to the target cluster cluster unlike in the results obtained from SNP-based analyses, and the cluster composed of two UK samples showed no ancestral gene flow. It is notable that the small sample sizes for non-Finnish plant materials may affect the accuracy of the gene flow and other group/ population-based estimates.
The AMOVA showed that 45% and 26% of genetic variation lies among the seven groups/populations (p < .001; the only Canadian sample excluded) and the rest within them based on SNP and SilicoDArT markers, respectively, which indicates a high degree of differentiation among groups/populations, especially based on SNP marker data (Table 1)

| D ISCUSS I ON
Our study shows the suitability of SNP and SilicoDArT markers for investigations on I. glandulifera. Filtering sequencing data retained 937 SNPs out of original 29,625 markers and 11,391 SilicoDArT markers out of 21,493 markers. Such large numbers of markers when compared with previous microsatellite-, cpDNA-, or ITSbased studies (e.g., Cafa et al., 2020;Hagenblad et al., 2015;Kurose et al., 2020;Nagy & Korpelainen, 2015;Provan et al., 2007) provide a good opportunity to obtain precise knowledge of the genetic relationships among plants originating from different geographic regions. As a by-product of the present genotyping-by-sequencing project on I. glandulifera, we have developed in silico 259 microsatellite markers but those have not been tested otherwise (Korpelainen & Pietiläinen, 2020).
The informativeness of the SNP and SilicoDArT markers was as-  Studying the population structure is a key point for understanding patterns of gene flow and for inferring the history of populations.
The Bayesian clustering approaches implemented, for example, in BAPS (Corander et al., 2013), are effective and reliable methods for revealing phylogenetic relationships, and population structures. In   Additionally, we had hypothesized that the two marker types provide similar results of genetic relationships and differentiation patterns. Overall, this was true. Some detected deviations may at least partly relate to the small sample sizes for non-Finnish plant materials.

ACK N OWLED G M ENTS
We thank Maria Pietiläinen for help in laboratory work and the University of Helsinki for financial support.

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
Genotyping data are archived in Dryad https://doi.org/10.5061/ dryad.dv41n s1xn.